About our data:
## # A tibble: 4 x 2
## publish_country mean_videos
## <chr> <dbl>
## 1 CANADA 199.
## 2 FRANCE 199.
## 3 GB 190.
## 4 US 200.
Our YouTube data set has 161,470 records and 17 variables. The variables in this dataset are: “video_id”, “trending_date”, “title,”channel_title“,”category_id" “publish_date”, “time_frame”, “published_day_of_week”, “publish_country”, “tags”, “views”, “likes”, “dislikes”, “comment_count”, “comments_disabled”, “ratings_disabled”, and “video_error_or_removed”. The variables have various data types that we will be using in our analysis, including character, integer, boolean, and factors. Each row in the data set is a particular trending video on a specific trending date. Additionally, four separate countries are analyzed, the United States, Canada, Great Britain, and France. Each country has its own list of trending videos on each day. The trending videos are taken from November of 2017 to June of 2018. According to Google (the owner of YouTube), the trending list is updated approximately every 15 minutes (Citation Needed). Thus, the number of videos that are trending throughout a day fluctuates. The number of videos on the trending list at any given time is around 200 in each country.
Our questions focus on five main areas of focus:
For the most part, our data set is quite user friendly. When loading the data, R automatically assigns certain data types. However, some of the automatic data types assigned are not helpful for future analysis and were changed.
We also had to clean the category_id variable. The raw data assigns a number to each category. We researched what categories these numbers corresponded to and relabeled the data using the category names. Using Youtube’s API (https://gist.github.com/dgp/1b24bf2961521bd75d6c), we relabeled the numbers to factors.
Our data consists of 18 different video categories. These categories are broad and range from “Pets and Animals” to “Music” to “News and Politics”. The graph below shows the number of trending videos in each category in our data.
Thus, we can see that “Entertainment” videos are the most frequent type of videos that appear on the trending list. It is important to note that each video can only be listed under one category.
We can also see how the number of trending videos in each category changes over time as well. Our data consists of trending videos from November of 2017 through June of 2018. Figure 2 below shows the number of trending videos over time in the categories “Entertainment,” “Music,” “People and Blogs,” “Comedy,” and “News and Politics”. These categories are the top five categories with the most trending videos as shown above. We aim to answer the following question:
How has the number of trending videos for different categories change over time?
In analyzing the above graph, we see that the number of trending videos for “Entertainment” stays relatively constant over time. For “Music” videos, we see an increase starting around March and this increase continues into May. The increase in “Music” makes sense as many artists release music during this time (so the song can become popular before summertime, but is still considered new). The categories of “Comedy,” “People and Blogs,” and “News and Politics” are constant throughout until “Music” begins to make its increase. During “Music”’s increase, these three categories decrease.
## Selecting by total_views
## Selecting by total_comments
## # A tibble: 2 x 2
## category_id rate
## <fct> <dbl>
## 1 Entertainment 0.00385
## 2 Music 0.00216
## # A tibble: 10 x 2
## trending_date nbr_trending_videos
## <date> <int>
## 1 2018-04-01 780
## 2 2018-04-02 780
## 3 2018-04-03 789
## 4 2018-04-04 786
## 5 2018-04-05 792
## 6 2018-04-06 798
## 7 2018-04-07 794
## 8 2018-04-14 790
## 9 2018-04-15 785
## 10 2018-04-16 781
We would predict that the more views a video has the more likes it has, and we see this through the upward trends. There is quite a bit of fanning out in each of these above graphs. In Canada and France, the association between views and likes is positive and linear. In Great Britain and the US, the association is still positive, but the association is less upward sloping than the other countries. The amount of likes does not increase as much as the amount of views is rising compared to Canada and France. We can’t be sure that this would not be the same for the other countries, however, because Great Britain and the US have the most videos with the largest number of views. If the other countries had more videos with more views, we may see a similar trend.
## `geom_smooth()` using formula 'y ~ x'
We see a positive and fairly linear trend in this figure as the one before, but now only looking at the music category and dislikes rather than likes. It appears that a video that was published in Great Britain is an outlier with nearly 300,000 dislikes, so it was removed to make the graphs more interpretable. Great Britain has the smallest sloping line - their more viewed music videos do not tend to be disliked as much compared to the other three countries. Similarly to the association between views and likes, the trend between views and dislikes is positive. So, the more views a video has, the more dislikes it will have in the music category.
## `geom_smooth()` using formula 'y ~ x'
There is a music video that was published in the US with nearly 600,000 comments and 4,000,000 views. This video was removed to make the graph above more interpretable. The more likes a video has in the category of music, the more comments there are. The trend is similar in each of the four countries. The lines are positively sloped and pretty linear. Higher number of likes is associated with a higher number of comments. There are very few outliers that stray from the highly linear pattern.
## # A tibble: 5 x 2
## category_id median_days_return
## <fct> <drtn>
## 1 Music 31 days
## 2 People and Blogs 31 days
## 3 Comedy 31 days
## 4 Entertainment 31 days
## 5 News and Politics 31 days
## Don't know how to automatically pick scale for object of type difftime. Defaulting to continuous.
Some videos were on the trending list multiple dates that were not consecutive. This plot does not include consecutive days trending. If a video leaves the trending list, it appears to be likely to come back within only a couple of days. However, there are quite a few outliers as shown on the boxplot. Some videos in the music category even took about a month to return to the trending list. Videos in the category of music have the largest spread of time between trending dates. Videos in the category of news and politics have the smallest spread and least time between trending dates. The top 4 categories on this boxplot have a median number of days before returning to trending of 3 days. In the News and Politics category, the median number of days is 2.
## # A tibble: 4 x 2
## publish_country median_days_trending
## <chr> <dbl>
## 1 CANADA 1
## 2 FRANCE 1
## 3 GB 10
## 4 US 7
The above plot shows the number of days videos are trending in each of the four publish countries using the proportion of videos. This graph is also filtered for the number of days trending as less than 100 because there were some very large outliers that made the graphs very difficult to read. As we can see, in Canada and France it appears that the videos they published trend for shorter periods of time - usually less than a week. In Great Britain and the US, however, the videos published tend to trend for longer. While it appears that the majority of these videos trend for two weeks or less, Great Britain videos can trend up to about 50 days, and US videos can trend up to about 60 days. In every country other than the US, the largest proportion of videos trend for one day. In the US, however, there is a spike of the greatest proportion at about one week. The median number of days videos are trending in Canada and France is 1. In Great Britain it is 10, which we see from the curve not tapering off as quickly. In the US, videos trend for a median of 7 days.
Here we have a summary of some important metrics in the data. France has the largest number of videos in four of the top five most common categories by instances. Canada has the largest number of entertainment videos. There are also metrics for the mean number of likes, dislikes, comments, and views in each of the four countries.
youtube_table_cat <- youtube %>%
distinct(video_id, .keep_all = TRUE) %>%
filter(category_id == "Entertainment" | category_id == "Music" |
category_id == "People and Blogs" | category_id == "Comedy" |
category_id == "News and Politics") %>%
group_by(category_id) %>%
summarise(num_videos = n(),
mean_views = round(mean(views)),
mean_likes = round(mean(likes)),
mean_dislikes = round(mean(dislikes)),
mean_comments = round(mean(comment_count)))
reactable(youtube_table_cat, bordered = TRUE, striped = TRUE, columns = list(
category_id = colDef("Category"),
num_videos = colDef("Number of Videos"),
mean_views = colDef(name = "Mean Number of Views"),
mean_comments = colDef(name = "Mean Number of Comments"),
mean_likes = colDef(name = "Mean Number of Likes"),
mean_dislikes = colDef(name = "Mean Number of Dislikes")))
Looking at the top five categories in the data, here we have the number of distinct videos trending in that category, and mean number of views, likes, dislikes, and comments.
youtube_table_week <- youtube %>%
distinct(video_id, .keep_all = TRUE) %>%
filter(category_id == "Entertainment" | category_id == "Music" |
category_id == "People and Blogs" | category_id == "Comedy" |
category_id == "News and Politics") %>%
group_by(published_day_of_week) %>%
summarise(num_music = sum(ifelse(category_id=="Music", 1, 0)),
num_entertainment = sum(ifelse(category_id=="Entertainment", 1, 0)),
num_peopleblogs = sum(ifelse(category_id=="People and Blogs", 1, 0)),
num_comedy = sum(ifelse(category_id=="Comedy", 1, 0)),
num_newspolitics = sum(ifelse(category_id=="News and Politics", 1, 0)),
mean_views = round(mean(views)),
mean_likes = round(mean(likes)),
mean_dislikes = round(mean(dislikes)),
mean_comments = round(mean(comment_count)))
reactable(youtube_table_week, bordered = TRUE, striped = TRUE, columns = list(
published_day_of_week = colDef(name = "Published Day of Week"),
num_music = colDef(name = "Number of Music Videos"),
num_entertainment = colDef(name = "Number of Entertainment Videos"),
num_peopleblogs = colDef(name = "Number of People & Blogs Videos"),
num_comedy = colDef(name = "Number of Comedy Videos"),
num_newspolitics = colDef(name = "Number of News & Politics Videos"),
mean_views = colDef(name = "Mean Number of Views"),
mean_comments = colDef(name = "Mean Number of Comments"),
mean_likes = colDef(name = "Mean Number of Likes"),
mean_dislikes = colDef(name = "Mean Number of Dislikes")))
This table shows important metrics for trending videos that were published on each day of the week in the top five categories. The largest mean number of views, likes, and comments happen on Fridays, while the largest mean number of dislikes happen on Sundays.